Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]

SUMMARY: [Sample Paragraph - The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Connectionist Bench dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.]

INTRODUCTION: [Sample Paragraph - The data file patterns obtained by bouncing sonar signals off a metal cylinder or a rock at various angles and under various conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock. Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time.]

ANALYSIS: [Sample Paragraph - The baseline performance of the machine learning algorithms achieved an average accuracy of 77.62%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an accuracy metric of 80.85%. After applying the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an accuracy of 80.65%, which was slightly below the prediction accuracy gained from the training data.]

CONCLUSION: [Sample Paragraph - For this iteration, the Gradient Boosting algorithm achieved the best overall training and validation results. For this dataset, the Gradient Boosting algorithm could be considered for further modeling.]

Dataset Used: [Connectionist Bench (Sonar, Mines vs. Rocks) Data Set]

Dataset ML Model: Binary classification with [numerical | categorical] attributes

Dataset Reference: [https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29]

One potential source of performance benchmarks: [https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29]

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Problem
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Problem

1.a) Load libraries

startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
library(corrplot)
## corrplot 0.84 loaded
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(mailR)
## Registered S3 method overwritten by 'R.oo':
##   method        from       
##   throw.default R.methodsS3
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(stringr)

# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)

1.b) Set up the email notification function

email_notify <- function(msg=""){
  sender <- Sys.getenv("MAIL_SENDER")
  receiver <- Sys.getenv("MAIL_RECEIVER")
  gateway <- Sys.getenv("SMTP_GATEWAY")
  smtpuser <- Sys.getenv("SMTP_USERNAME")
  password <- Sys.getenv("SMTP_PASSWORD")
  sbj_line <- "Notification from R Binary Classification Script"
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = gateway, port = 587, user.name = smtpuser, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
# Set up the muteEmail flag to stop sending progress emails (setting FALSE will send emails!)
muteEmail <- FALSE
if (!muteEmail) if (!muteEmail) email_notify(paste("Library and Data Loading has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5a39699c}"

1.c) Load dataset

# Slicing up the document path to get the final destination file name
dataset_path <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
doc_path_list <- str_split(dataset_path, "/")
dest_file <- doc_path_list[[1]][length(doc_path_list[[1]])]

if (!file.exists(dest_file)) {
  # Download the document from the website
  cat("Downloading", dataset_path, "as", dest_file, "\n")
  download.file(dataset_path, dest_file, mode = "wb")
  cat(dest_file, "downloaded!\n")
#  unzip(dest_file)
#  cat(dest_file, "unpacked!\n")
}

inputFile <- dest_file
colNames <- paste0("attr",1:60)
colNames <- c(colNames, 'targetVar')
xy_original <- read.csv(inputFile, sep=',', header=FALSE, col.names = colNames)

# Different ways of reading and processing the input dataset. Saving these for future references.
#x_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#xy_train <- cbind(x_train, y_train)
# Take a peek at the dataframe after the import
head(xy_original)
##    attr1  attr2  attr3  attr4  attr5  attr6  attr7  attr8  attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
##   attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 3 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 5 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
##   attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 3 0.7974 0.6737 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 5 0.4148 0.4292 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
##   attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 3 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 5 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
##   attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 3 0.4647 0.2587 0.2129 0.2222 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 5 0.1979 0.2444 0.1847 0.0841 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
##   attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 3 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 5 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
##   targetVar
## 1         R
## 2         R
## 3         R
## 4         R
## 5         R
## 6         R
sapply(xy_original, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr57    attr58    attr59    attr60 targetVar 
## "numeric" "numeric" "numeric" "numeric"  "factor"
sapply(xy_original, function(x) sum(is.na(x)))
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
##         0         0         0         0         0         0         0 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
##         0         0         0         0         0         0         0 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
##         0         0         0         0         0         0         0 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
##         0         0         0         0         0         0         0 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
##         0         0         0         0         0         0         0 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
##         0         0         0         0         0         0         0 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
##         0         0         0         0         0         0         0 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
##         0         0         0         0         0         0         0 
##    attr57    attr58    attr59    attr60 targetVar 
##         0         0         0         0         0

1.d) Data Cleaning

# Not applicable for this iteration of the project
# Sample code for performing basic data cleaning tasks

# Dropping features
# xy_original$column_name <- NULL

# Mark missing values
# invalid <- 0
# xy_original$column_name[xy_original$column_name==invalid] <- NA

# Impute missing values
# column_median <- median(xy_original$column_name, na.rm = TRUE)
# xy_original$column_name[xy_original$column_name==0] <- column_median
# xy_original$column_name <- with(xy_original, impute(column_name, cholumn_median))

# Convert columns from one data type to another
# xy_original$column_name <- as.integer(xy_original$column_name)
# xy_original$column_name <- as.factor(xy_original$column_name)
# Take a peek at the dataframe after the cleaning
head(xy_original)
##    attr1  attr2  attr3  attr4  attr5  attr6  attr7  attr8  attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
##   attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 3 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 5 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
##   attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 3 0.7974 0.6737 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 5 0.4148 0.4292 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
##   attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 3 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 5 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
##   attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 3 0.4647 0.2587 0.2129 0.2222 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 5 0.1979 0.2444 0.1847 0.0841 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
##   attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 3 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 5 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
##   targetVar
## 1         R
## 2         R
## 3         R
## 4         R
## 5         R
## 6         R
sapply(xy_original, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr57    attr58    attr59    attr60 targetVar 
## "numeric" "numeric" "numeric" "numeric"  "factor"
sapply(xy_original, function(x) sum(is.na(x)))
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
##         0         0         0         0         0         0         0 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
##         0         0         0         0         0         0         0 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
##         0         0         0         0         0         0         0 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
##         0         0         0         0         0         0         0 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
##         0         0         0         0         0         0         0 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
##         0         0         0         0         0         0         0 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
##         0         0         0         0         0         0         0 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
##         0         0         0         0         0         0         0 
##    attr57    attr58    attr59    attr60 targetVar 
##         0         0         0         0         0

1.e) Splitting Data into Training and Testing Sets

# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(xy_original)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol

# Standardize the class column to the name of targetVar if applicable
#colnames(xy_original)[targetCol] <- "targetVar"
#xy_original$targetVar <- relevel(xy_original$targetVar,"pos")
# We create training datasets (xy_train, x_train, y_train) for various visualization and cleaning/transformation operations.
# We create testing datasets (xy_test, y_test) for various visualization and cleaning/transformation operations.
set.seed(seedNum)

# Create a list of the rows in the original dataset we can use for training
# Use 70% of the data to train the models and the remaining for testing/validation
training_index <- createDataPartition(xy_original$targetVar, p=0.70, list=FALSE)
xy_train <- xy_original[training_index,]
xy_test <- xy_original[-training_index,]

if (targetCol==1) {
x_train <- xy_train[,(targetCol+1):totCol]
y_train <- xy_train[,targetCol]
y_test <- xy_test[,targetCol]
} else {
x_train <- xy_train[,1:(totAttr)]
y_train <- xy_train[,totCol]
y_test <- xy_test[,totCol]
}

1.f) Set up the key parameters to be used in the script

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 4
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  4  by  15
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
if (!muteEmail) email_notify(paste("Library and Data Loading completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1cd072a9}"

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

if (!muteEmail) email_notify(paste("Data Summarization and Visualization has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6a5fc7f7}"

2.a) Descriptive statistics

2.a.i) Peek at the data itself.

head(xy_train)
##    attr1  attr2  attr3  attr4  attr5  attr6  attr7  attr8  attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
## 7 0.0317 0.0956 0.1321 0.1408 0.1674 0.1710 0.0731 0.1401 0.2083 0.3513
## 8 0.0519 0.0548 0.0842 0.0319 0.1158 0.0922 0.1027 0.0613 0.1465 0.2838
##   attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
## 7 0.1786 0.0658 0.0513 0.3752 0.5419 0.5440 0.5150 0.4262 0.2024 0.4233
## 8 0.2802 0.3086 0.2657 0.3801 0.5626 0.4376 0.2617 0.1199 0.6676 0.9402
##   attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
## 7 0.7723 0.9735 0.9390 0.5559 0.5268 0.6826 0.5713 0.5429 0.2177 0.2149
## 8 0.7832 0.5352 0.6809 0.9174 0.7613 0.8220 0.8872 0.6091 0.2967 0.1103
##   attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
## 7 0.5811 0.6323 0.2965 0.1873 0.2969 0.5163 0.6153 0.4283 0.5479 0.6133
## 8 0.1318 0.0624 0.0990 0.4006 0.3666 0.1050 0.1915 0.3930 0.4288 0.2546
##   attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
## 7 0.5017 0.2377 0.1957 0.1749 0.1304 0.0597 0.1124 0.1047 0.0507 0.0159
## 8 0.1151 0.2196 0.1879 0.1437 0.2146 0.2360 0.1125 0.0254 0.0285 0.0178
##   attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
## 7 0.0195 0.0201 0.0248 0.0131 0.0070 0.0138 0.0092 0.0143 0.0036 0.0103
## 8 0.0052 0.0081 0.0120 0.0045 0.0121 0.0097 0.0085 0.0047 0.0048 0.0053
##   targetVar
## 1         R
## 2         R
## 4         R
## 6         R
## 7         R
## 8         R

2.a.ii) Dimensions of the dataset.

dim(xy_train)
## [1] 146  61

2.a.iii) Types of the attributes.

sapply(xy_train, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr57    attr58    attr59    attr60 targetVar 
## "numeric" "numeric" "numeric" "numeric"  "factor"

2.a.iv) Statistical summary of all attributes.

summary(xy_train)
##      attr1             attr2             attr3             attr4        
##  Min.   :0.00150   Min.   :0.00220   Min.   :0.00240   Min.   :0.00580  
##  1st Qu.:0.01383   1st Qu.:0.01673   1st Qu.:0.01887   1st Qu.:0.02455  
##  Median :0.02300   Median :0.03165   Median :0.03470   Median :0.04405  
##  Mean   :0.02938   Mean   :0.03922   Mean   :0.04527   Mean   :0.05499  
##  3rd Qu.:0.03658   3rd Qu.:0.04735   3rd Qu.:0.06028   3rd Qu.:0.06270  
##  Max.   :0.13130   Max.   :0.23390   Max.   :0.30590   Max.   :0.42640  
##      attr5             attr6             attr7            attr8        
##  Min.   :0.00670   Min.   :0.01020   Min.   :0.0033   Min.   :0.00550  
##  1st Qu.:0.03572   1st Qu.:0.06797   1st Qu.:0.0860   1st Qu.:0.07855  
##  Median :0.06250   Median :0.09215   Median :0.1102   Median :0.11265  
##  Mean   :0.07441   Mean   :0.10360   Mean   :0.1209   Mean   :0.13236  
##  3rd Qu.:0.09945   3rd Qu.:0.14005   3rd Qu.:0.1526   3rd Qu.:0.16900  
##  Max.   :0.40100   Max.   :0.27700   Max.   :0.3016   Max.   :0.45660  
##      attr9            attr10           attr11           attr12      
##  Min.   :0.0117   Min.   :0.0113   Min.   :0.0289   Min.   :0.0236  
##  1st Qu.:0.0907   1st Qu.:0.1110   1st Qu.:0.1329   1st Qu.:0.1383  
##  Median :0.1466   Median :0.1842   Median :0.2248   Median :0.2478  
##  Mean   :0.1731   Mean   :0.2033   Mean   :0.2326   Mean   :0.2473  
##  3rd Qu.:0.2347   3rd Qu.:0.2712   3rd Qu.:0.2984   3rd Qu.:0.3297  
##  Max.   :0.6828   Max.   :0.7106   Max.   :0.7342   Max.   :0.6552  
##      attr13           attr14           attr15           attr16      
##  Min.   :0.0184   Min.   :0.0273   Min.   :0.0031   Min.   :0.0162  
##  1st Qu.:0.1754   1st Qu.:0.1860   1st Qu.:0.1802   1st Qu.:0.2051  
##  Median :0.2516   Median :0.2878   Median :0.3010   Median :0.3393  
##  Mean   :0.2748   Mean   :0.3037   Mean   :0.3346   Mean   :0.3945  
##  3rd Qu.:0.3615   3rd Qu.:0.3948   3rd Qu.:0.4937   3rd Qu.:0.5433  
##  Max.   :0.7022   Max.   :0.9970   Max.   :1.0000   Max.   :0.9988  
##      attr17           attr18           attr19           attr20      
##  Min.   :0.0349   Min.   :0.0689   Min.   :0.0494   Min.   :0.0740  
##  1st Qu.:0.2164   1st Qu.:0.2449   1st Qu.:0.3124   1st Qu.:0.3508  
##  Median :0.3508   Median :0.3803   Median :0.4480   Median :0.5443  
##  Mean   :0.4357   Mean   :0.4698   Mean   :0.5168   Mean   :0.5672  
##  3rd Qu.:0.6827   3rd Qu.:0.6913   3rd Qu.:0.7426   3rd Qu.:0.8179  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      attr21           attr22           attr23           attr24      
##  Min.   :0.0512   Min.   :0.0689   Min.   :0.0563   Min.   :0.0239  
##  1st Qu.:0.4035   1st Qu.:0.4295   1st Qu.:0.4600   1st Qu.:0.5307  
##  Median :0.6479   Median :0.6836   Median :0.7015   Median :0.6944  
##  Mean   :0.6127   Mean   :0.6379   Mean   :0.6570   Mean   :0.6748  
##  3rd Qu.:0.8358   3rd Qu.:0.8537   3rd Qu.:0.8635   3rd Qu.:0.8748  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      attr25           attr26           attr27           attr28      
##  Min.   :0.0395   Min.   :0.0921   Min.   :0.0481   Min.   :0.0284  
##  1st Qu.:0.5340   1st Qu.:0.5486   1st Qu.:0.5362   1st Qu.:0.5005  
##  Median :0.7061   Median :0.7501   Median :0.7462   Median :0.7124  
##  Mean   :0.6723   Mean   :0.7006   Mean   :0.7015   Mean   :0.6731  
##  3rd Qu.:0.8641   3rd Qu.:0.8904   3rd Qu.:0.9143   3rd Qu.:0.8703  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      attr29           attr30           attr31           attr32      
##  Min.   :0.0144   Min.   :0.0613   Min.   :0.0482   Min.   :0.0404  
##  1st Qu.:0.4426   1st Qu.:0.3825   1st Qu.:0.3043   1st Qu.:0.2766  
##  Median :0.6296   Median :0.5657   Median :0.4487   Median :0.3986  
##  Mean   :0.6186   Mean   :0.5614   Mean   :0.4856   Mean   :0.4267  
##  3rd Qu.:0.8443   3rd Qu.:0.7310   3rd Qu.:0.6284   3rd Qu.:0.5607  
##  Max.   :1.0000   Max.   :1.0000   Max.   :0.9657   Max.   :0.9306  
##      attr33           attr34           attr35           attr36      
##  Min.   :0.0477   Min.   :0.0212   Min.   :0.0223   Min.   :0.0271  
##  1st Qu.:0.2554   1st Qu.:0.2097   1st Qu.:0.1737   1st Qu.:0.1550  
##  Median :0.3897   Median :0.3372   Median :0.3042   Median :0.3184  
##  Mean   :0.4076   Mean   :0.3909   Mean   :0.3874   Mean   :0.3901  
##  3rd Qu.:0.5381   3rd Qu.:0.5903   3rd Qu.:0.6110   3rd Qu.:0.5825  
##  Max.   :0.9708   Max.   :0.9647   Max.   :1.0000   Max.   :1.0000  
##      attr37           attr38           attr39           attr40      
##  Min.   :0.0351   Min.   :0.0383   Min.   :0.0371   Min.   :0.0117  
##  1st Qu.:0.1673   1st Qu.:0.1722   1st Qu.:0.1671   1st Qu.:0.1782  
##  Median :0.3283   Median :0.2940   Median :0.2782   Median :0.2791  
##  Mean   :0.3713   Mean   :0.3323   Mean   :0.3151   Mean   :0.3094  
##  3rd Qu.:0.5402   3rd Qu.:0.4423   3rd Qu.:0.4277   3rd Qu.:0.4299  
##  Max.   :0.9497   Max.   :1.0000   Max.   :0.9857   Max.   :0.9167  
##      attr41           attr42           attr43           attr44      
##  Min.   :0.0360   Min.   :0.0056   Min.   :0.0159   Min.   :0.0255  
##  1st Qu.:0.1604   1st Qu.:0.1555   1st Qu.:0.1578   1st Qu.:0.1279  
##  Median :0.2661   Median :0.2415   Median :0.2364   Median :0.1764  
##  Mean   :0.2854   Mean   :0.2750   Mean   :0.2471   Mean   :0.2141  
##  3rd Qu.:0.3939   3rd Qu.:0.3856   3rd Qu.:0.3197   3rd Qu.:0.2685  
##  Max.   :0.7322   Max.   :0.8246   Max.   :0.7517   Max.   :0.5772  
##      attr45           attr46            attr47            attr48       
##  Min.   :0.0095   Min.   :0.00250   Min.   :0.00730   Min.   :0.00410  
##  1st Qu.:0.1074   1st Qu.:0.06897   1st Qu.:0.06283   1st Qu.:0.04537  
##  Median :0.1489   Median :0.12525   Median :0.10550   Median :0.07860  
##  Mean   :0.2003   Mean   :0.16084   Mean   :0.12130   Mean   :0.09138  
##  3rd Qu.:0.2369   3rd Qu.:0.20413   3rd Qu.:0.15345   3rd Qu.:0.11925  
##  Max.   :0.7034   Max.   :0.72920   Max.   :0.55220   Max.   :0.33390  
##      attr49            attr50            attr51             attr52        
##  Min.   :0.00210   Min.   :0.00060   Min.   :0.000900   Min.   :0.001300  
##  1st Qu.:0.02660   1st Qu.:0.01202   1st Qu.:0.009125   1st Qu.:0.007725  
##  Median :0.04515   Median :0.01840   Median :0.014750   Median :0.011250  
##  Mean   :0.05242   Mean   :0.02073   Mean   :0.016533   Mean   :0.013642  
##  3rd Qu.:0.07130   3rd Qu.:0.02602   3rd Qu.:0.021475   3rd Qu.:0.016375  
##  Max.   :0.16080   Max.   :0.06370   Max.   :0.100400   Max.   :0.070900  
##      attr53            attr54             attr55        
##  Min.   :0.00050   Min.   :0.001800   Min.   :0.001200  
##  1st Qu.:0.00450   1st Qu.:0.006025   1st Qu.:0.003925  
##  Median :0.00935   Median :0.009300   Median :0.007500  
##  Mean   :0.01035   Mean   :0.011103   Mean   :0.009260  
##  3rd Qu.:0.01468   3rd Qu.:0.014500   3rd Qu.:0.012325  
##  Max.   :0.03170   Max.   :0.035200   Max.   :0.037200  
##      attr56             attr57            attr58        
##  Min.   :0.000600   Min.   :0.00030   Min.   :0.000600  
##  1st Qu.:0.004800   1st Qu.:0.00405   1st Qu.:0.003900  
##  Median :0.007450   Median :0.00655   Median :0.006350  
##  Mean   :0.008347   Mean   :0.00798   Mean   :0.008359  
##  3rd Qu.:0.011075   3rd Qu.:0.01050   3rd Qu.:0.011200  
##  Max.   :0.032600   Max.   :0.02580   Max.   :0.037700  
##      attr59             attr60         targetVar
##  Min.   :0.000200   Min.   :0.000600   M:78     
##  1st Qu.:0.004300   1st Qu.:0.003225   R:68     
##  Median :0.006400   Median :0.005350            
##  Mean   :0.007888   Mean   :0.006714            
##  3rd Qu.:0.010475   3rd Qu.:0.008775            
##  Max.   :0.033200   Max.   :0.043900

2.a.v) Count missing values.

sapply(xy_train, function(x) sum(is.na(x)))
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
##         0         0         0         0         0         0         0 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
##         0         0         0         0         0         0         0 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
##         0         0         0         0         0         0         0 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
##         0         0         0         0         0         0         0 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
##         0         0         0         0         0         0         0 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
##         0         0         0         0         0         0         0 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
##         0         0         0         0         0         0         0 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
##         0         0         0         0         0         0         0 
##    attr57    attr58    attr59    attr60 targetVar 
##         0         0         0         0         0

2.a.vi) Summarize the levels of the class attribute.

cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
##   freq percentage
## M   78   53.42466
## R   68   46.57534

2.b) Data visualizations

2.b.i) Univariate plots to better understand each attribute.

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(x_train[,i], main=names(x_train)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(x_train[,i], main=names(x_train)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(x_train[,i]), main=names(x_train)[i])
}

2.b.ii) Multivariate plots to better understand the relationships between attributes

# Scatterplot matrix colored by class
#pairs(targetVar~., data=xy_train, col=xy_train$targetVar)
# Box and whisker plots for each attribute by class
#scales <- list(x=list(relation="free"), y=list(relation="free"))
#featurePlot(x=x_train, y=y_train, plot="box", scales=scales)
# Density plots for each attribute by class value
#featurePlot(x=x_train, y=y_train, plot="density", scales=scales)
# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")

if (!muteEmail) email_notify(paste("Data Summarization and Visualization completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@511baa65}"

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

if (!muteEmail) email_notify(paste("Data Cleaning and Transformation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6fadae5d}"

3.a) Feature Selection

# Not applicable for this iteration of the project
# Using the correlations calculated previously, we try to find attributes that are highly correlated.
# highlyCorrelated <- findCorrelation(correlations, cutoff=0.85)
# print(highlyCorrelated)
# cat('Number of attributes found to be highly correlated:',length(highlyCorrelated))

# Removing the highly correlated attributes from the training and validation dataframes
# xy_train <- xy_train[, -highlyCorrelated]
# xy_test <- xy_test[, -highlyCorrelated]
# Not applicable for this iteration of the project
# Sample code for performing feature selection by ranking the attributes' importance.
# startTimeModule <- proc.time()
# set.seed(seedNum)
# library(gbm)
# model_fs <- train(targetVar~., data=xy_train, method="gbm", preProcess="scale", trControl=control, verbose=F)
# rankedImportance <- varImp(model_fs, scale=FALSE)
# print(rankedImportance)
# plot(rankedImportance)

# Set the importance threshold and calculate the list of attributes that don't contribute to the importance threshold
# maxThreshold <- 0.99
# rankedAttributes <- rankedImportance$importance
# rankedAttributes <- rankedAttributes[order(-rankedAttributes$Overall),,drop=FALSE]
# totalWeight <- sum(rankedAttributes)
# i <- 1
# accumWeight <- 0
# exit_now <- FALSE
# while ((i <= totAttr) & !exit_now) {
#   accumWeight = accumWeight + rankedAttributes[i,]
#   if ((accumWeight/totalWeight) >= maxThreshold) {
#     exit_now <- TRUE
#   } else {
#     i <- i + 1
#   }
# }
# lowImportance <- rankedAttributes[(i+1):(totAttr),,drop=FALSE]
# lowAttributes <- rownames(lowImportance)
# cat('Number of attributes contributed to the importance threshold:',i,"\n")
# cat('Number of attributes found to be of low importance:',length(lowAttributes))

# Removing the unselected attributes from the training and validation dataframes
# xy_train <- xy_train[, !(names(xy_train) %in% lowAttributes)]
# xy_test <- xy_test[, !(names(xy_test) %in% lowAttributes)]
# Not applicable for this iteration of the project
# Sample code for perfoming feature selection using the Recursive Feature Elimination (RFE) technique
# startTimeModule <- proc.time()
# set.seed(seedNum)
# rfeCTRL <- rfeControl(functions=rfFuncs, method="cv", number=10)
# rfeResults <- rfe(xy_train[,1:totAttr], xy_train[,totCol], sizes=c(30:55), rfeControl=rfeCTRL)
# print(rfeResults)
# rfeAttributes <- predictors(rfeResults)
# cat('Number of attributes identified from the RFE algorithm:',length(rfeAttributes))
# print(rfeAttributes)
# plot(rfeResults, type=c("g", "o"))

# Removing the unselected attributes from the training and validation dataframes
# rfeAttributes <- c(rfeAttributes,"targetVar")
# xy_train <- xy_train[, (names(xy_train) %in% rfeAttributes)]
# xy_test <- xy_test[, (names(xy_test) %in% rfeAttributes)]

3.b) Data Transforms

# Not applicable for this iteration of the project
# Sample code for performing SMOTE transformation to combat the unbalanced data
# set.seed(seedNum)
# xy_train <- SMOTE(targetVar ~., data=xy_train, perc.over=200, perc.under=300)
# totCol <- ncol(xy_train)
# y_train <- xy_train[,totCol]
# cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)

3.c) Display the Final Dataset for Model-Building

dim(xy_train)
## [1] 146  61
dim(xy_test)
## [1] 62 61
sapply(xy_train, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48    attr49 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr50    attr51    attr52    attr53    attr54    attr55    attr56 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr57    attr58    attr59    attr60 targetVar 
## "numeric" "numeric" "numeric" "numeric"  "factor"
if (!muteEmail) email_notify(paste("Data Cleaning and Transformation completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@35f983a6}"
proc.time()-startTimeScript
##    user  system elapsed 
##  26.540   0.360  34.798

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:

For this project, we will evaluate one linear, one non-linear, and three ensemble algorithms:

Linear Algorithm: Logistic Regression

Non-Linear Algorithm: Decision Trees (CART)

Ensemble Algorithms: Bagged CART, Random Forest, and Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

# Logistic Regression (Classification)
if (!muteEmail) email_notify(paste("Logistic Regression modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3498ed}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.glm <- train(targetVar~., data=xy_train, method="glm", metric=metricTarget, trControl=control)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(fit.glm)
## Generalized Linear Model 
## 
## 146 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7390476  0.4767065
proc.time()-startTimeModule
##    user  system elapsed 
##   0.820   0.000   0.812
if (!muteEmail) email_notify(paste("Logistic Regression modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@38082d64}"

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
if (!muteEmail) email_notify(paste("Decision Tree modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@180bc464}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART 
## 
## 146 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.07352941  0.6847619  0.3640170
##   0.08823529  0.6776190  0.3492613
##   0.50000000  0.5833333  0.1200159
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.07352941.
proc.time()-startTimeModule
##    user  system elapsed 
##   0.900   0.000   0.899
if (!muteEmail) email_notify(paste("Decision Tree modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6fb554cc}"

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
if (!muteEmail) email_notify(paste("Bagged CART modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1936f0f5}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 146 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7680952  0.5299798
proc.time()-startTimeModule
##    user  system elapsed 
##   3.740   0.010   3.742
if (!muteEmail) email_notify(paste("Bagged CART modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@35fb3008}"
# Random Forest (Regression/Classification)
if (!muteEmail) email_notify(paste("Random Forest modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@737996a0}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 146 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8357143  0.6634035
##   31    0.7747619  0.5451204
##   60    0.7880952  0.5716670
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##    user  system elapsed 
##   6.300   0.010   6.314
if (!muteEmail) email_notify(paste("Random Forest modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@13a57a3b}"
# Gradient Boosting (Regression/Classification)
if (!muteEmail) email_notify(paste("Gradient Boosting modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2669b199}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="xgbTree", metric=metricTarget, trControl=control, verbose=F)
# fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## eXtreme Gradient Boosting 
## 
## 146 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  Accuracy 
##   0.3  1          0.6               0.50        50      0.8152381
##   0.3  1          0.6               0.50       100      0.8428571
##   0.3  1          0.6               0.50       150      0.8233333
##   0.3  1          0.6               0.75        50      0.8090476
##   0.3  1          0.6               0.75       100      0.7957143
##   0.3  1          0.6               0.75       150      0.8152381
##   0.3  1          0.6               1.00        50      0.8076190
##   0.3  1          0.6               1.00       100      0.8485714
##   0.3  1          0.6               1.00       150      0.8423810
##   0.3  1          0.8               0.50        50      0.8419048
##   0.3  1          0.8               0.50       100      0.8490476
##   0.3  1          0.8               0.50       150      0.8561905
##   0.3  1          0.8               0.75        50      0.8090476
##   0.3  1          0.8               0.75       100      0.8085714
##   0.3  1          0.8               0.75       150      0.8361905
##   0.3  1          0.8               1.00        50      0.8152381
##   0.3  1          0.8               1.00       100      0.8561905
##   0.3  1          0.8               1.00       150      0.8428571
##   0.3  2          0.6               0.50        50      0.8014286
##   0.3  2          0.6               0.50       100      0.8152381
##   0.3  2          0.6               0.50       150      0.8366667
##   0.3  2          0.6               0.75        50      0.8085714
##   0.3  2          0.6               0.75       100      0.8223810
##   0.3  2          0.6               0.75       150      0.8366667
##   0.3  2          0.6               1.00        50      0.8352381
##   0.3  2          0.6               1.00       100      0.8633333
##   0.3  2          0.6               1.00       150      0.8495238
##   0.3  2          0.8               0.50        50      0.8766667
##   0.3  2          0.8               0.50       100      0.8352381
##   0.3  2          0.8               0.50       150      0.8285714
##   0.3  2          0.8               0.75        50      0.8357143
##   0.3  2          0.8               0.75       100      0.8223810
##   0.3  2          0.8               0.75       150      0.8223810
##   0.3  2          0.8               1.00        50      0.8490476
##   0.3  2          0.8               1.00       100      0.8495238
##   0.3  2          0.8               1.00       150      0.8428571
##   0.3  3          0.6               0.50        50      0.8504762
##   0.3  3          0.6               0.50       100      0.8295238
##   0.3  3          0.6               0.50       150      0.8366667
##   0.3  3          0.6               0.75        50      0.8695238
##   0.3  3          0.6               0.75       100      0.8704762
##   0.3  3          0.6               0.75       150      0.8704762
##   0.3  3          0.6               1.00        50      0.8157143
##   0.3  3          0.6               1.00       100      0.8290476
##   0.3  3          0.6               1.00       150      0.8290476
##   0.3  3          0.8               0.50        50      0.8361905
##   0.3  3          0.8               0.50       100      0.8428571
##   0.3  3          0.8               0.50       150      0.8495238
##   0.3  3          0.8               0.75        50      0.8366667
##   0.3  3          0.8               0.75       100      0.8571429
##   0.3  3          0.8               0.75       150      0.8504762
##   0.3  3          0.8               1.00        50      0.8423810
##   0.3  3          0.8               1.00       100      0.8490476
##   0.3  3          0.8               1.00       150      0.8423810
##   0.4  1          0.6               0.50        50      0.8014286
##   0.4  1          0.6               0.50       100      0.8023810
##   0.4  1          0.6               0.50       150      0.8023810
##   0.4  1          0.6               0.75        50      0.8219048
##   0.4  1          0.6               0.75       100      0.8561905
##   0.4  1          0.6               0.75       150      0.8704762
##   0.4  1          0.6               1.00        50      0.8290476
##   0.4  1          0.6               1.00       100      0.8490476
##   0.4  1          0.6               1.00       150      0.8357143
##   0.4  1          0.8               0.50        50      0.8280952
##   0.4  1          0.8               0.50       100      0.8290476
##   0.4  1          0.8               0.50       150      0.8423810
##   0.4  1          0.8               0.75        50      0.8285714
##   0.4  1          0.8               0.75       100      0.8285714
##   0.4  1          0.8               0.75       150      0.8147619
##   0.4  1          0.8               1.00        50      0.8357143
##   0.4  1          0.8               1.00       100      0.8357143
##   0.4  1          0.8               1.00       150      0.8357143
##   0.4  2          0.6               0.50        50      0.8495238
##   0.4  2          0.6               0.50       100      0.8628571
##   0.4  2          0.6               0.50       150      0.8428571
##   0.4  2          0.6               0.75        50      0.8357143
##   0.4  2          0.6               0.75       100      0.8428571
##   0.4  2          0.6               0.75       150      0.8423810
##   0.4  2          0.6               1.00        50      0.8347619
##   0.4  2          0.6               1.00       100      0.8285714
##   0.4  2          0.6               1.00       150      0.8285714
##   0.4  2          0.8               0.50        50      0.8347619
##   0.4  2          0.8               0.50       100      0.8485714
##   0.4  2          0.8               0.50       150      0.8352381
##   0.4  2          0.8               0.75        50      0.8552381
##   0.4  2          0.8               0.75       100      0.8490476
##   0.4  2          0.8               0.75       150      0.8485714
##   0.4  2          0.8               1.00        50      0.8357143
##   0.4  2          0.8               1.00       100      0.8566667
##   0.4  2          0.8               1.00       150      0.8566667
##   0.4  3          0.6               0.50        50      0.8352381
##   0.4  3          0.6               0.50       100      0.8352381
##   0.4  3          0.6               0.50       150      0.8347619
##   0.4  3          0.6               0.75        50      0.8419048
##   0.4  3          0.6               0.75       100      0.8490476
##   0.4  3          0.6               0.75       150      0.8490476
##   0.4  3          0.6               1.00        50      0.8209524
##   0.4  3          0.6               1.00       100      0.8490476
##   0.4  3          0.6               1.00       150      0.8490476
##   0.4  3          0.8               0.50        50      0.8161905
##   0.4  3          0.8               0.50       100      0.8228571
##   0.4  3          0.8               0.50       150      0.8228571
##   0.4  3          0.8               0.75        50      0.8490476
##   0.4  3          0.8               0.75       100      0.8495238
##   0.4  3          0.8               0.75       150      0.8423810
##   0.4  3          0.8               1.00        50      0.8495238
##   0.4  3          0.8               1.00       100      0.8704762
##   0.4  3          0.8               1.00       150      0.8704762
##   Kappa    
##   0.6269515
##   0.6800768
##   0.6412784
##   0.6137278
##   0.5867604
##   0.6257157
##   0.6131517
##   0.6959970
##   0.6817465
##   0.6816369
##   0.6944030
##   0.7086888
##   0.6131820
##   0.6129151
##   0.6690120
##   0.6291544
##   0.7104835
##   0.6827579
##   0.6000520
##   0.6266617
##   0.6695189
##   0.6121573
##   0.6395144
##   0.6674526
##   0.6676661
##   0.7235815
##   0.6960347
##   0.7485045
##   0.6651003
##   0.6523107
##   0.6655254
##   0.6392402
##   0.6392402
##   0.6947622
##   0.6955366
##   0.6817817
##   0.6952238
##   0.6531452
##   0.6674309
##   0.7314825
##   0.7354981
##   0.7348902
##   0.6262422
##   0.6532780
##   0.6532780
##   0.6687733
##   0.6807599
##   0.6937972
##   0.6672048
##   0.7092217
##   0.6954625
##   0.6809859
##   0.6952478
##   0.6815077
##   0.6006864
##   0.6004031
##   0.6013369
##   0.6389682
##   0.7080999
##   0.7371583
##   0.6545138
##   0.6949743
##   0.6674754
##   0.6535142
##   0.6549530
##   0.6810870
##   0.6517208
##   0.6537727
##   0.6264717
##   0.6686793
##   0.6677145
##   0.6689379
##   0.6975957
##   0.7248493
##   0.6840757
##   0.6669667
##   0.6810134
##   0.6797586
##   0.6671311
##   0.6538916
##   0.6538916
##   0.6657127
##   0.6947822
##   0.6684791
##   0.7067686
##   0.6936395
##   0.6928610
##   0.6684227
##   0.7089366
##   0.7089366
##   0.6708530
##   0.6697102
##   0.6684659
##   0.6802525
##   0.6941618
##   0.6941618
##   0.6381358
##   0.6937986
##   0.6937986
##   0.6258671
##   0.6387655
##   0.6390223
##   0.6921711
##   0.6924451
##   0.6781594
##   0.6940700
##   0.7362397
##   0.7362397
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 2,
##  eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1
##  and subsample = 0.5.
proc.time()-startTimeModule
##    user  system elapsed 
##  36.270   0.760  19.699
if (!muteEmail) email_notify(paste("Gradient Boosting modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@23ceabc1}"

4.d) Compare baseline algorithms

results <- resamples(list(LR=fit.glm, CART=fit.cart, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: LR, CART, BagCART, RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR      0.5333333 0.6607143 0.7666667 0.7390476 0.8000000 0.9285714    0
## CART    0.4666667 0.5892857 0.7238095 0.6847619 0.7726190 0.8666667    0
## BagCART 0.4000000 0.6785714 0.8000000 0.7680952 0.9107143 0.9333333    0
## RF      0.6000000 0.7892857 0.8333333 0.8357143 0.9321429 1.0000000    0
## GBM     0.6666667 0.7857143 0.9309524 0.8766667 0.9833333 1.0000000    0
## 
## Kappa 
##                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR       0.08695652 0.3331202 0.5257208 0.4767065 0.5999761 0.8571429    0
## CART    -0.05263158 0.1785714 0.4356061 0.3640170 0.5446429 0.7368421    0
## BagCART -0.19469027 0.3525836 0.5944629 0.5299798 0.8199405 0.8672566    0
## RF       0.16666667 0.5616826 0.6596494 0.6634035 0.8629344 1.0000000    0
## GBM      0.32432432 0.5577508 0.8621997 0.7485045 0.9668142 1.0000000    0
dotplot(results)

cat('The average accuracy from all models is:',
    mean(c(results$values$`LR~Accuracy`,results$values$`CART~Accuracy`,results$values$`BagCART~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)))
## The average accuracy from all models is: 0.7808571

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - Random Forest
if (!muteEmail) email_notify(paste("Algorithm #1 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@396a51ab}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry=c(2,15,30,45,60))
fit.final1 <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.final1)
## Random Forest 
## 
## 146 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8357143  0.6634035
##   31    0.7747619  0.5451204
##   60    0.7880952  0.5716670
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##    user  system elapsed 
##   6.170   0.010   6.181
if (!muteEmail) email_notify(paste("Algorithm #1 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@548a9f61}"
# Tuning algorithm #2 - Gradient Boosting
if (!muteEmail) email_notify(paste("Algorithm #2 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1b604f19}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(nrounds=c(100,150,200,250,300), max_depth=3, eta=0.3, gamma=0, colsample_bytree=0.6, min_child_weight=1, subsample=1)
fit.final2 <- train(targetVar~., data=xy_train, method="xgbTree", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)

print(fit.final2)
## eXtreme Gradient Boosting 
## 
## 146 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  Accuracy   Kappa    
##   100      0.8561905  0.7095273
##   150      0.8495238  0.6945820
##   200      0.8495238  0.6945820
##   250      0.8495238  0.6945820
##   300      0.8495238  0.6945820
## 
## Tuning parameter 'max_depth' was held constant at a value of 3
##  0.6
## Tuning parameter 'min_child_weight' was held constant at a value of
##  1
## Tuning parameter 'subsample' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 3,
##  eta = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
##  and subsample = 1.
proc.time()-startTimeModule
##    user  system elapsed 
##   2.460   0.050   1.614
if (!muteEmail) email_notify(paste("Algorithm #2 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@55f3ddb1}"

5.d) Compare Algorithms After Tuning

results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## RF  0.6000000 0.7892857 0.8333333 0.8357143 0.9321429    1    0
## GBM 0.6428571 0.7500000 0.8619048 0.8561905 0.9821429    1    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## RF  0.1666667 0.5616826 0.6596494 0.6634035 0.8629344    1    0
## GBM 0.2857143 0.4950033 0.7232143 0.7095273 0.9642857    1    0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

if (!muteEmail) email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@46d56d67}"

6.a) Predictions on validation dataset

predictions <- predict(fit.final2, newdata=xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 25  4
##          R  8 25
##                                           
##                Accuracy : 0.8065          
##                  95% CI : (0.6863, 0.8958)
##     No Information Rate : 0.5323          
##     P-Value [Acc > NIR] : 6.468e-06       
##                                           
##                   Kappa : 0.6145          
##                                           
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.7576          
##             Specificity : 0.8621          
##          Pos Pred Value : 0.8621          
##          Neg Pred Value : 0.7576          
##              Prevalence : 0.5323          
##          Detection Rate : 0.4032          
##    Detection Prevalence : 0.4677          
##       Balanced Accuracy : 0.8098          
##                                           
##        'Positive' Class : M               
## 
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)

auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8098224

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
library(xgboost)
set.seed(seedNum)

# Combining the training and test datasets to form the original dataset that will be used for training the final model
xy_complete <- rbind(xy_train, xy_test)
y_final <- xy_complete$targetVar
xy_complete$targetVar <- NULL
x_final <- as.matrix(xy_complete)

finalModel <- xgboost(data=x_final, label=y_final, nrounds=200, max_depth=3, eta=0.3, gamma=0, colsample_bytree=0.6, min_child_weight=1, subsample=1)
## [1]  train-rmse:0.803984 
## [2]  train-rmse:0.600953 
## [3]  train-rmse:0.464395 
## [4]  train-rmse:0.366991 
## [5]  train-rmse:0.299308 
## [6]  train-rmse:0.256248 
## [7]  train-rmse:0.217517 
## [8]  train-rmse:0.196324 
## [9]  train-rmse:0.177970 
## [10] train-rmse:0.164598 
## [11] train-rmse:0.155103 
## [12] train-rmse:0.141579 
## [13] train-rmse:0.133015 
## [14] train-rmse:0.126360 
## [15] train-rmse:0.121515 
## [16] train-rmse:0.113750 
## [17] train-rmse:0.108468 
## [18] train-rmse:0.102441 
## [19] train-rmse:0.095145 
## [20] train-rmse:0.091136 
## [21] train-rmse:0.086799 
## [22] train-rmse:0.083983 
## [23] train-rmse:0.081710 
## [24] train-rmse:0.077199 
## [25] train-rmse:0.075749 
## [26] train-rmse:0.073273 
## [27] train-rmse:0.070372 
## [28] train-rmse:0.066862 
## [29] train-rmse:0.065097 
## [30] train-rmse:0.060902 
## [31] train-rmse:0.059349 
## [32] train-rmse:0.057318 
## [33] train-rmse:0.055403 
## [34] train-rmse:0.051558 
## [35] train-rmse:0.049563 
## [36] train-rmse:0.047672 
## [37] train-rmse:0.046153 
## [38] train-rmse:0.044706 
## [39] train-rmse:0.043398 
## [40] train-rmse:0.041250 
## [41] train-rmse:0.040274 
## [42] train-rmse:0.038911 
## [43] train-rmse:0.037621 
## [44] train-rmse:0.035641 
## [45] train-rmse:0.033812 
## [46] train-rmse:0.032290 
## [47] train-rmse:0.031603 
## [48] train-rmse:0.030984 
## [49] train-rmse:0.030197 
## [50] train-rmse:0.029279 
## [51] train-rmse:0.028632 
## [52] train-rmse:0.027892 
## [53] train-rmse:0.027304 
## [54] train-rmse:0.026544 
## [55] train-rmse:0.025815 
## [56] train-rmse:0.024534 
## [57] train-rmse:0.023862 
## [58] train-rmse:0.023203 
## [59] train-rmse:0.022846 
## [60] train-rmse:0.022465 
## [61] train-rmse:0.021273 
## [62] train-rmse:0.020742 
## [63] train-rmse:0.020035 
## [64] train-rmse:0.019481 
## [65] train-rmse:0.019089 
## [66] train-rmse:0.018774 
## [67] train-rmse:0.018189 
## [68] train-rmse:0.017689 
## [69] train-rmse:0.016974 
## [70] train-rmse:0.016630 
## [71] train-rmse:0.016139 
## [72] train-rmse:0.015161 
## [73] train-rmse:0.014741 
## [74] train-rmse:0.014248 
## [75] train-rmse:0.013938 
## [76] train-rmse:0.013585 
## [77] train-rmse:0.012958 
## [78] train-rmse:0.012424 
## [79] train-rmse:0.012033 
## [80] train-rmse:0.011696 
## [81] train-rmse:0.011043 
## [82] train-rmse:0.010800 
## [83] train-rmse:0.010449 
## [84] train-rmse:0.010220 
## [85] train-rmse:0.009679 
## [86] train-rmse:0.009455 
## [87] train-rmse:0.009294 
## [88] train-rmse:0.009022 
## [89] train-rmse:0.008786 
## [90] train-rmse:0.008420 
## [91] train-rmse:0.008239 
## [92] train-rmse:0.007926 
## [93] train-rmse:0.007720 
## [94] train-rmse:0.007574 
## [95] train-rmse:0.007377 
## [96] train-rmse:0.007070 
## [97] train-rmse:0.006941 
## [98] train-rmse:0.006618 
## [99] train-rmse:0.006212 
## [100]    train-rmse:0.005889 
## [101]    train-rmse:0.005711 
## [102]    train-rmse:0.005442 
## [103]    train-rmse:0.005232 
## [104]    train-rmse:0.005112 
## [105]    train-rmse:0.004865 
## [106]    train-rmse:0.004741 
## [107]    train-rmse:0.004470 
## [108]    train-rmse:0.004343 
## [109]    train-rmse:0.004233 
## [110]    train-rmse:0.004028 
## [111]    train-rmse:0.003832 
## [112]    train-rmse:0.003740 
## [113]    train-rmse:0.003653 
## [114]    train-rmse:0.003535 
## [115]    train-rmse:0.003416 
## [116]    train-rmse:0.003296 
## [117]    train-rmse:0.003093 
## [118]    train-rmse:0.003049 
## [119]    train-rmse:0.002995 
## [120]    train-rmse:0.002895 
## [121]    train-rmse:0.002823 
## [122]    train-rmse:0.002752 
## [123]    train-rmse:0.002693 
## [124]    train-rmse:0.002625 
## [125]    train-rmse:0.002568 
## [126]    train-rmse:0.002507 
## [127]    train-rmse:0.002388 
## [128]    train-rmse:0.002349 
## [129]    train-rmse:0.002211 
## [130]    train-rmse:0.002107 
## [131]    train-rmse:0.002033 
## [132]    train-rmse:0.001902 
## [133]    train-rmse:0.001829 
## [134]    train-rmse:0.001781 
## [135]    train-rmse:0.001675 
## [136]    train-rmse:0.001626 
## [137]    train-rmse:0.001575 
## [138]    train-rmse:0.001550 
## [139]    train-rmse:0.001522 
## [140]    train-rmse:0.001479 
## [141]    train-rmse:0.001421 
## [142]    train-rmse:0.001397 
## [143]    train-rmse:0.001380 
## [144]    train-rmse:0.001353 
## [145]    train-rmse:0.001322 
## [146]    train-rmse:0.001290 
## [147]    train-rmse:0.001263 
## [148]    train-rmse:0.001245 
## [149]    train-rmse:0.001226 
## [150]    train-rmse:0.001204 
## [151]    train-rmse:0.001172 
## [152]    train-rmse:0.001143 
## [153]    train-rmse:0.001083 
## [154]    train-rmse:0.001066 
## [155]    train-rmse:0.001015 
## [156]    train-rmse:0.000957 
## [157]    train-rmse:0.000897 
## [158]    train-rmse:0.000875 
## [159]    train-rmse:0.000861 
## [160]    train-rmse:0.000844 
## [161]    train-rmse:0.000819 
## [162]    train-rmse:0.000788 
## [163]    train-rmse:0.000772 
## [164]    train-rmse:0.000753 
## [165]    train-rmse:0.000745 
## [166]    train-rmse:0.000728 
## [167]    train-rmse:0.000710 
## [168]    train-rmse:0.000703 
## [169]    train-rmse:0.000673 
## [170]    train-rmse:0.000648 
## [171]    train-rmse:0.000631 
## [172]    train-rmse:0.000631 
## [173]    train-rmse:0.000612 
## [174]    train-rmse:0.000612 
## [175]    train-rmse:0.000612 
## [176]    train-rmse:0.000612 
## [177]    train-rmse:0.000612 
## [178]    train-rmse:0.000612 
## [179]    train-rmse:0.000612 
## [180]    train-rmse:0.000612 
## [181]    train-rmse:0.000612 
## [182]    train-rmse:0.000612 
## [183]    train-rmse:0.000612 
## [184]    train-rmse:0.000612 
## [185]    train-rmse:0.000612 
## [186]    train-rmse:0.000612 
## [187]    train-rmse:0.000612 
## [188]    train-rmse:0.000612 
## [189]    train-rmse:0.000612 
## [190]    train-rmse:0.000612 
## [191]    train-rmse:0.000612 
## [192]    train-rmse:0.000612 
## [193]    train-rmse:0.000612 
## [194]    train-rmse:0.000612 
## [195]    train-rmse:0.000612 
## [196]    train-rmse:0.000612 
## [197]    train-rmse:0.000612 
## [198]    train-rmse:0.000612 
## [199]    train-rmse:0.000612 
## [200]    train-rmse:0.000612
print(finalModel)
## ##### xgb.Booster
## raw: 107.4 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = nrounds, 
##     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
##     early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
##     callbacks = callbacks, max_depth = 3, eta = 0.3, gamma = 0, 
##     colsample_bytree = 0.6, min_child_weight = 1, subsample = 1)
## params (as set within xgb.train):
##   max_depth = "3", eta = "0.3", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsample = "1", silent = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
##   cb.evaluation.log()
## # of features: 60 
## niter: 200
## nfeatures : 60 
## evaluation_log:
##     iter train_rmse
##        1   0.803984
##        2   0.600953
## ---                
##      199   0.000612
##      200   0.000612
proc.time()-startTimeModule
##    user  system elapsed 
##    0.38    0.00    0.21

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_BinaryClass.rds")
if (!muteEmail) email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3532ec19}"
proc.time()-startTimeScript
##    user  system elapsed 
##  88.510   1.280 100.183